Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Disk-based k-mer counting on a PC

Identifieur interne : 002010 ( Main/Exploration ); précédent : 002009; suivant : 002011

Disk-based k-mer counting on a PC

Auteurs : Sebastian Deorowicz [Pologne] ; Agnieszka Debudaj-Grabysz [Pologne] ; Szymon Grabowski [Pologne]

Source :

RBID : PMC:3680041

Descripteurs français

English descriptors

Abstract

Background

The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection.

Results

We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data.

Conclusions

By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.


Url:
DOI: 10.1186/1471-2105-14-160
PubMed: 23679007
PubMed Central: 3680041


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Disk-based
<italic>k</italic>
-mer counting on a PC</title>
<author>
<name sortKey="Deorowicz, Sebastian" sort="Deorowicz, Sebastian" uniqKey="Deorowicz S" first="Sebastian" last="Deorowicz">Sebastian Deorowicz</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">, Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, 44-100, Poland</nlm:aff>
<country xml:lang="fr">Pologne</country>
<wicri:regionArea>, Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, 44-100</wicri:regionArea>
<wicri:noRegion>44-100</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Debudaj Grabysz, Agnieszka" sort="Debudaj Grabysz, Agnieszka" uniqKey="Debudaj Grabysz A" first="Agnieszka" last="Debudaj-Grabysz">Agnieszka Debudaj-Grabysz</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">, Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, 44-100, Poland</nlm:aff>
<country xml:lang="fr">Pologne</country>
<wicri:regionArea>, Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, 44-100</wicri:regionArea>
<wicri:noRegion>44-100</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Grabowski, Szymon" sort="Grabowski, Szymon" uniqKey="Grabowski S" first="Szymon" last="Grabowski">Szymon Grabowski</name>
<affiliation wicri:level="1">
<nlm:aff id="I2">, Institute of Computer Science, Lodz University of Technology, Al. Politechniki 11, Łódź, 90-924, Poland</nlm:aff>
<country xml:lang="fr">Pologne</country>
<wicri:regionArea>, Institute of Computer Science, Lodz University of Technology, Al. Politechniki 11, Łódź, 90-924</wicri:regionArea>
<wicri:noRegion>90-924</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">23679007</idno>
<idno type="pmc">3680041</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3680041</idno>
<idno type="RBID">PMC:3680041</idno>
<idno type="doi">10.1186/1471-2105-14-160</idno>
<date when="2013">2013</date>
<idno type="wicri:Area/Pmc/Corpus">000945</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000945</idno>
<idno type="wicri:Area/Pmc/Curation">000945</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000945</idno>
<idno type="wicri:Area/Pmc/Checkpoint">001236</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Checkpoint">001236</idno>
<idno type="wicri:source">PubMed</idno>
<idno type="RBID">pubmed:23679007</idno>
<idno type="wicri:Area/PubMed/Corpus">001C93</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">001C93</idno>
<idno type="wicri:Area/PubMed/Curation">001C93</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">001C93</idno>
<idno type="wicri:Area/PubMed/Checkpoint">001B60</idno>
<idno type="wicri:explorRef" wicri:stream="Checkpoint" wicri:step="PubMed">001B60</idno>
<idno type="wicri:Area/Ncbi/Merge">000A48</idno>
<idno type="wicri:Area/Ncbi/Curation">000A48</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000A48</idno>
<idno type="wicri:Area/Main/Merge">002031</idno>
<idno type="wicri:Area/Main/Curation">002010</idno>
<idno type="wicri:Area/Main/Exploration">002010</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Disk-based
<italic>k</italic>
-mer counting on a PC</title>
<author>
<name sortKey="Deorowicz, Sebastian" sort="Deorowicz, Sebastian" uniqKey="Deorowicz S" first="Sebastian" last="Deorowicz">Sebastian Deorowicz</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">, Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, 44-100, Poland</nlm:aff>
<country xml:lang="fr">Pologne</country>
<wicri:regionArea>, Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, 44-100</wicri:regionArea>
<wicri:noRegion>44-100</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Debudaj Grabysz, Agnieszka" sort="Debudaj Grabysz, Agnieszka" uniqKey="Debudaj Grabysz A" first="Agnieszka" last="Debudaj-Grabysz">Agnieszka Debudaj-Grabysz</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">, Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, 44-100, Poland</nlm:aff>
<country xml:lang="fr">Pologne</country>
<wicri:regionArea>, Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, 44-100</wicri:regionArea>
<wicri:noRegion>44-100</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Grabowski, Szymon" sort="Grabowski, Szymon" uniqKey="Grabowski S" first="Szymon" last="Grabowski">Szymon Grabowski</name>
<affiliation wicri:level="1">
<nlm:aff id="I2">, Institute of Computer Science, Lodz University of Technology, Al. Politechniki 11, Łódź, 90-924, Poland</nlm:aff>
<country xml:lang="fr">Pologne</country>
<wicri:regionArea>, Institute of Computer Science, Lodz University of Technology, Al. Politechniki 11, Łódź, 90-924</wicri:regionArea>
<wicri:noRegion>90-924</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2013">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Animals</term>
<term>Caenorhabditis elegans (genetics)</term>
<term>Genome, Human</term>
<term>Genomics (methods)</term>
<term>Humans</term>
<term>Microcomputers</term>
<term>Sequence Alignment</term>
<term>Sequence Analysis, DNA</term>
<term>Software</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr">
<term>Algorithmes</term>
<term>Alignement de séquences</term>
<term>Analyse de séquence d'ADN</term>
<term>Animaux</term>
<term>Caenorhabditis elegans (génétique)</term>
<term>Génome humain</term>
<term>Génomique ()</term>
<term>Humains</term>
<term>Logiciel</term>
<term>Micro-ordinateurs</term>
</keywords>
<keywords scheme="MESH" qualifier="genetics" xml:lang="en">
<term>Caenorhabditis elegans</term>
</keywords>
<keywords scheme="MESH" qualifier="génétique" xml:lang="fr">
<term>Caenorhabditis elegans</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Genomics</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Animals</term>
<term>Genome, Human</term>
<term>Humans</term>
<term>Microcomputers</term>
<term>Sequence Alignment</term>
<term>Sequence Analysis, DNA</term>
<term>Software</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr">
<term>Algorithmes</term>
<term>Alignement de séquences</term>
<term>Analyse de séquence d'ADN</term>
<term>Animaux</term>
<term>Génome humain</term>
<term>Génomique</term>
<term>Humains</term>
<term>Logiciel</term>
<term>Micro-ordinateurs</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>The
<italic>k</italic>
-mer counting problem, which is to build the histogram of occurrences of every
<italic>k</italic>
-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection.</p>
</sec>
<sec>
<title>Results</title>
<p>We propose a simple, yet efficient, parallel disk-based algorithm for counting
<italic>k</italic>
-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive
<italic>k</italic>
-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Compeau, Pe" uniqKey="Compeau P">PE Compeau</name>
</author>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
<author>
<name sortKey="Tesler, G" uniqKey="Tesler G">G Tesler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kurtz, S" uniqKey="Kurtz S">S Kurtz</name>
</author>
<author>
<name sortKey="Narechania, A" uniqKey="Narechania A">A Narechania</name>
</author>
<author>
<name sortKey="Stein, J" uniqKey="Stein J">J Stein</name>
</author>
<author>
<name sortKey="Ware, D" uniqKey="Ware D">D Ware</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kelley, Dr" uniqKey="Kelley D">DR Kelley</name>
</author>
<author>
<name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Miller, Jr" uniqKey="Miller J">JR Miller</name>
</author>
<author>
<name sortKey="Delcher, Al" uniqKey="Delcher A">AL Delcher</name>
</author>
<author>
<name sortKey="Koren, S" uniqKey="Koren S">S Koren</name>
</author>
<author>
<name sortKey="Venter, E" uniqKey="Venter E">E Venter</name>
</author>
<author>
<name sortKey="Walenz, B" uniqKey="Walenz B">B Walenz</name>
</author>
<author>
<name sortKey="Brownley, A" uniqKey="Brownley A">A Brownley</name>
</author>
<author>
<name sortKey="Johnson, J" uniqKey="Johnson J">J Johnson</name>
</author>
<author>
<name sortKey="Li, K" uniqKey="Li K">K Li</name>
</author>
<author>
<name sortKey="Mobarry, Cm" uniqKey="Mobarry C">CM Mobarry</name>
</author>
<author>
<name sortKey="Sutton, Gg" uniqKey="Sutton G">GG Sutton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcais, G" uniqKey="Marcais G">G Marçais</name>
</author>
<author>
<name sortKey="Kingsford, C" uniqKey="Kingsford C">C Kingsford</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Melsted, P" uniqKey="Melsted P">P Melsted</name>
</author>
<author>
<name sortKey="Pritchard, Jk" uniqKey="Pritchard J">JK Pritchard</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pell, J" uniqKey="Pell J">J Pell</name>
</author>
<author>
<name sortKey="Hintze, A" uniqKey="Hintze A">A Hintze</name>
</author>
<author>
<name sortKey="Canino Koning, R" uniqKey="Canino Koning R">R Canino-Koning</name>
</author>
<author>
<name sortKey="Howe, A" uniqKey="Howe A">A Howe</name>
</author>
<author>
<name sortKey="Tiedje, Jm" uniqKey="Tiedje J">JM Tiedje</name>
</author>
<author>
<name sortKey="Brown, Ct" uniqKey="Brown C">CT Brown</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rizk, G" uniqKey="Rizk G">G Rizk</name>
</author>
<author>
<name sortKey="Lavenier, D" uniqKey="Lavenier D">D Lavenier</name>
</author>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Satish, N" uniqKey="Satish N">N Satish</name>
</author>
<author>
<name sortKey="Kim, C" uniqKey="Kim C">C Kim</name>
</author>
<author>
<name sortKey="Chhugani, J" uniqKey="Chhugani J">J Chhugani</name>
</author>
<author>
<name sortKey="Nguyen, Ad" uniqKey="Nguyen A">AD Nguyen</name>
</author>
<author>
<name sortKey="Lee, Vw" uniqKey="Lee V">VW Lee</name>
</author>
<author>
<name sortKey="Kim, D" uniqKey="Kim D">D Kim</name>
</author>
<author>
<name sortKey="Dubey, P" uniqKey="Dubey P">P Dubey</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dean, J" uniqKey="Dean J">J Dean</name>
</author>
<author>
<name sortKey="Ghemawat, S" uniqKey="Ghemawat S">S Ghemawat</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<affiliations>
<list>
<country>
<li>Pologne</li>
</country>
</list>
<tree>
<country name="Pologne">
<noRegion>
<name sortKey="Deorowicz, Sebastian" sort="Deorowicz, Sebastian" uniqKey="Deorowicz S" first="Sebastian" last="Deorowicz">Sebastian Deorowicz</name>
</noRegion>
<name sortKey="Debudaj Grabysz, Agnieszka" sort="Debudaj Grabysz, Agnieszka" uniqKey="Debudaj Grabysz A" first="Agnieszka" last="Debudaj-Grabysz">Agnieszka Debudaj-Grabysz</name>
<name sortKey="Grabowski, Szymon" sort="Grabowski, Szymon" uniqKey="Grabowski S" first="Szymon" last="Grabowski">Szymon Grabowski</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002010 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 002010 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     PMC:3680041
   |texte=   Disk-based k-mer counting on a PC
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Exploration/RBID.i   -Sk "pubmed:23679007" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021